📚 Has the Importance of Education for Determining Income Increased Over Time?¶
🎯 Research Question¶
Has the importance of education in predicting income increased over time, compared to other factors such as age, gender, or occupation?
Understanding this helps assess the role of educational attainment in shaping economic outcomes, and whether its predictive power has evolved with structural changes in the labor market.
🧪 Objective¶
To build and compare several supervised regression models that predict income levels, and evaluate the relative contribution of education over time. I use data of 2000, 2010, 2015, and 2020.
🧰 Methodology¶
We use the following machine learning models:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Gradient Boosting Regressor
These models are trained on a dataset containing individual-level variables, including:
ESCOLARIDAD_ACUMULADA(accumulated years of schooling / total years of education)EDAD(age)SEXO(gender)RELIGION(religion)ANIO(year of survey)- And other socio-demographic features
Interaction terms like ESCOLARIDAD_ACUMULADA × ANIO are included to examine changes in the predictive role of education over time.
📌 Outputs¶
We compare model performance using:
- ✅ Cross-validated Mean Squared Error (MSE) and Variance
- ✅ Number of non-zero coefficients (for sparse models like Lasso)
- ✅ Feature importance via model coefficients and SHAP values
- ✅ A final comparison table summarizing performance
- ✅ Visualizations of top predictors across models
Filtering of Data¶
# ===============================
# 📚 IMPORT LIBRARIES
# ===============================
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from category_encoders import TargetEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.feature_selection import mutual_info_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# ===============================
# 📁 LOAD DATASET
# ===============================
# Define the project directory and path to the cleaned dataset
project_dir = os.path.dirname(os.getcwd())
data_path = os.path.join(project_dir, "cleandata", "combined_personas_sample.csv")
# Read the CSV into a pandas DataFrame
df = pd.read_csv(
data_path,
low_memory=False, # Prevents dtype inference warnings
na_values=["NaN", "NA", "N/A", "", " "], # Define common NA representations
dtype={"UPM": str} # Ensure UPM is treated as a string
)
# Avoid scientific notation in pandas output
pd.set_option('display.float_format', '{:.4f}'.format)
# Display basic info about the dataset (number of rows, columns, dtypes, etc.)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72000 entries, 0 to 71999 Data columns (total 66 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ANIO 72000 non-null int64 1 ID_PERSONA 72000 non-null object 2 LLAVE_ENTIDAD 72000 non-null int64 3 LLAVE_MUNICIPIO 72000 non-null int64 4 CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) 72000 non-null int64 5 LLAVE_LOCALIDAD 72000 non-null int64 6 CLAVE_LOCALIDAD_INEGI 21039 non-null float64 7 ID_VIVIENDA 72000 non-null int64 8 ID_HOGAR 16000 non-null float64 9 LLAVE_COBERTURA 72000 non-null int64 10 LLAVE_CLASEVIVIENDA 72000 non-null int64 11 LLAVE_SEXO 72000 non-null int64 12 LLAVE_PARENTESCO 72000 non-null int64 13 LLAVE_IDENTMADRE 72000 non-null int64 14 LLAVE_IDENTPADRE 72000 non-null int64 15 LLAVE_PAIS_NAC 72000 non-null int64 16 LLAVE_ENTIDAD_NAC 72000 non-null int64 17 LLAVE_NACIONALIDAD 72000 non-null int64 18 LLAVE_SERSALUD 72000 non-null int64 19 LLAVE_AFRODES 72000 non-null int64 20 LLAVE_REGISNAC 72000 non-null int64 21 LLAVE_RELIGION 72000 non-null int64 22 LLAVE_HLENGUA 72000 non-null int64 23 LLAVE_LENGUAMAT 72000 non-null int64 24 LLAVE_HESPANOL 72000 non-null int64 25 LLAVE_ELENGUA 72000 non-null int64 26 LLAVE_PERTEINDIGENA 72000 non-null int64 27 LLAVE_ASISESCOLAR 72000 non-null int64 28 LLAVE_PAIS_ASISESCOLAR 72000 non-null int64 29 LLAVE_ENTIDAD_ASISESCOLAR 72000 non-null int64 30 LLAVE_MUNICIPIO_ASISESCOLAR 72000 non-null int64 31 LLAVE_TIETRASLADO_ESCOLAR 72000 non-null int64 32 LLAVE_MEDTRASLADO_ESCOLAR 72000 non-null int64 33 LLAVE_NIVACAD 72000 non-null int64 34 LLAVE_CARRERA 72000 non-null int64 35 LLAVE_ALFABETISMO 72000 non-null int64 36 LLAVE_PAIS_RES5A 72000 non-null int64 37 LLAVE_ENTIDAD_RES5A 72000 non-null int64 38 LLAVE_MUNICIPIO_RES5A 72000 non-null int64 39 LLAVE_CAUSAMIGRACION 72000 non-null int64 40 LLAVE_SITUACONYUGAL 72000 non-null int64 41 LLAVE_IDENTPAREJA 72000 non-null int64 42 LLAVE_ACTPRIMARIA 72000 non-null int64 43 LLAVE_OCUPACION 72000 non-null int64 44 LLAVE_SITTRA 72000 non-null int64 45 LLAVE_ACTECONOMICA 72000 non-null int64 46 ACTIVIDAD_ECONOMICA_INEGI 24624 non-null float64 47 LLAVE_PAIS_TRABAJO 72000 non-null int64 48 LLAVE_ENTIDAD_TRABAJO 72000 non-null int64 49 LLAVE_MUNICIPIO_TRABAJO 72000 non-null int64 50 LLAVE_TIETRASLADO_TRABAJO 72000 non-null int64 51 LLAVE_MEDTRASLADO_TRABAJO 72000 non-null int64 52 LLAVE_TAMLOC 72000 non-null int64 53 ESTRATO 72000 non-null object 54 UPM 72000 non-null object 55 FACTOR_EXP 72000 non-null int64 56 NUMPERSONA 56000 non-null float64 57 EDAD 71929 non-null float64 58 ESCOLARIDAD 66330 non-null float64 59 ESCOLARIDAD_ACUMULADA 66142 non-null float64 60 INGRESO 72000 non-null int64 61 HORAS_TRABAJADAS 16450 non-null float64 62 HIJOS_NACIDOS 25133 non-null float64 63 HIJOS_VIVOS 18525 non-null float64 64 HIJOS_FALLECIDOS 18525 non-null float64 65 MERCADO_TRABAJO_LOCAL 72000 non-null int64 dtypes: float64(11), int64(52), object(3) memory usage: 36.3+ MB
# ===============================
# 🧹 VARIABLE TYPE ASSIGNMENT
# ===============================
# 1. 📈 Define continuous (numerical) variables manually
continuous_vars = [
'INGRESO', 'HORAS_TRABAJADAS', 'ESCOLARIDAD_ACUMULADA', 'ESCOLARIDAD', 'EDAD',
'HIJOS_NACIDOS', 'HIJOS_VIVOS', 'HIJOS_FALLECIDOS'
]
# 2. 🏷️ Define categorical variables manually
categorical_vars = [
'ANIO', 'LLAVE_SEXO', 'LLAVE_PARENTESCO', 'LLAVE_IDENTMADRE', 'LLAVE_IDENTPADRE',
'LLAVE_PAIS_NAC', 'LLAVE_ENTIDAD_NAC', 'LLAVE_NACIONALIDAD', 'LLAVE_SERSALUD',
'LLAVE_AFRODES', 'LLAVE_REGISNAC', 'LLAVE_RELIGION', 'LLAVE_HLENGUA',
'LLAVE_LENGUAMAT', 'LLAVE_HESPANOL', 'LLAVE_ELENGUA', 'LLAVE_PERTEINDIGENA',
'LLAVE_ASISESCOLAR', 'LLAVE_PAIS_ASISESCOLAR', 'LLAVE_ENTIDAD_ASISESCOLAR',
'LLAVE_MUNICIPIO_ASISESCOLAR', 'LLAVE_TIETRASLADO_ESCOLAR',
'LLAVE_MEDTRASLADO_ESCOLAR', 'LLAVE_NIVACAD', 'LLAVE_CARRERA',
'LLAVE_ALFABETISMO', 'LLAVE_PAIS_RES5A', 'LLAVE_ENTIDAD_RES5A',
'LLAVE_MUNICIPIO_RES5A', 'LLAVE_CAUSAMIGRACION', 'LLAVE_SITUACONYUGAL',
'LLAVE_IDENTPAREJA', 'LLAVE_ACTPRIMARIA', 'LLAVE_OCUPACION', 'LLAVE_SITTRA',
'LLAVE_ACTECONOMICA', 'ACTIVIDAD_ECONOMICA_INEGI', 'LLAVE_PAIS_TRABAJO',
'LLAVE_ENTIDAD_TRABAJO', 'LLAVE_MUNICIPIO_TRABAJO',
'LLAVE_TIETRASLADO_TRABAJO', 'LLAVE_MEDTRASLADO_TRABAJO',
'LLAVE_TAMLOC', 'ESTRATO', 'UPM', 'MERCADO_TRABAJO_LOCAL'
]
# ===============================
# 🔄 VARIABLE TYPE CONVERSION
# ===============================
# Convert selected columns to categorical type
for col in categorical_vars:
if col in df.columns:
df[col] = df[col].astype('category')
# Convert selected columns to numeric (float), coercing errors
for col in continuous_vars:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# ✅ Display updated structure of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72000 entries, 0 to 71999 Data columns (total 66 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ANIO 72000 non-null category 1 ID_PERSONA 72000 non-null object 2 LLAVE_ENTIDAD 72000 non-null int64 3 LLAVE_MUNICIPIO 72000 non-null int64 4 CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) 72000 non-null int64 5 LLAVE_LOCALIDAD 72000 non-null int64 6 CLAVE_LOCALIDAD_INEGI 21039 non-null float64 7 ID_VIVIENDA 72000 non-null int64 8 ID_HOGAR 16000 non-null float64 9 LLAVE_COBERTURA 72000 non-null int64 10 LLAVE_CLASEVIVIENDA 72000 non-null int64 11 LLAVE_SEXO 72000 non-null category 12 LLAVE_PARENTESCO 72000 non-null category 13 LLAVE_IDENTMADRE 72000 non-null category 14 LLAVE_IDENTPADRE 72000 non-null category 15 LLAVE_PAIS_NAC 72000 non-null category 16 LLAVE_ENTIDAD_NAC 72000 non-null category 17 LLAVE_NACIONALIDAD 72000 non-null category 18 LLAVE_SERSALUD 72000 non-null category 19 LLAVE_AFRODES 72000 non-null category 20 LLAVE_REGISNAC 72000 non-null category 21 LLAVE_RELIGION 72000 non-null category 22 LLAVE_HLENGUA 72000 non-null category 23 LLAVE_LENGUAMAT 72000 non-null category 24 LLAVE_HESPANOL 72000 non-null category 25 LLAVE_ELENGUA 72000 non-null category 26 LLAVE_PERTEINDIGENA 72000 non-null category 27 LLAVE_ASISESCOLAR 72000 non-null category 28 LLAVE_PAIS_ASISESCOLAR 72000 non-null category 29 LLAVE_ENTIDAD_ASISESCOLAR 72000 non-null category 30 LLAVE_MUNICIPIO_ASISESCOLAR 72000 non-null category 31 LLAVE_TIETRASLADO_ESCOLAR 72000 non-null category 32 LLAVE_MEDTRASLADO_ESCOLAR 72000 non-null category 33 LLAVE_NIVACAD 72000 non-null category 34 LLAVE_CARRERA 72000 non-null category 35 LLAVE_ALFABETISMO 72000 non-null category 36 LLAVE_PAIS_RES5A 72000 non-null category 37 LLAVE_ENTIDAD_RES5A 72000 non-null category 38 LLAVE_MUNICIPIO_RES5A 72000 non-null category 39 LLAVE_CAUSAMIGRACION 72000 non-null category 40 LLAVE_SITUACONYUGAL 72000 non-null category 41 LLAVE_IDENTPAREJA 72000 non-null category 42 LLAVE_ACTPRIMARIA 72000 non-null category 43 LLAVE_OCUPACION 72000 non-null category 44 LLAVE_SITTRA 72000 non-null category 45 LLAVE_ACTECONOMICA 72000 non-null category 46 ACTIVIDAD_ECONOMICA_INEGI 24624 non-null category 47 LLAVE_PAIS_TRABAJO 72000 non-null category 48 LLAVE_ENTIDAD_TRABAJO 72000 non-null category 49 LLAVE_MUNICIPIO_TRABAJO 72000 non-null category 50 LLAVE_TIETRASLADO_TRABAJO 72000 non-null category 51 LLAVE_MEDTRASLADO_TRABAJO 72000 non-null category 52 LLAVE_TAMLOC 72000 non-null category 53 ESTRATO 72000 non-null category 54 UPM 72000 non-null category 55 FACTOR_EXP 72000 non-null int64 56 NUMPERSONA 56000 non-null float64 57 EDAD 71929 non-null float64 58 ESCOLARIDAD 66330 non-null float64 59 ESCOLARIDAD_ACUMULADA 66142 non-null float64 60 INGRESO 72000 non-null int64 61 HORAS_TRABAJADAS 16450 non-null float64 62 HIJOS_NACIDOS 25133 non-null float64 63 HIJOS_VIVOS 18525 non-null float64 64 HIJOS_FALLECIDOS 18525 non-null float64 65 MERCADO_TRABAJO_LOCAL 72000 non-null category dtypes: category(46), float64(10), int64(9), object(1) memory usage: 17.0+ MB
# ===============================
# 📊 NULL VALUE INSPECTION
# ===============================
# 1. Calculate the percentage of null values per column
null_percentage = np.round(100 - df.notnull().mean() * 100)
# 2. Create a DataFrame to organize results
null_summary = pd.DataFrame(null_percentage, columns=["Percentage of Null Values"])
# 3. Display the full summary
print(null_summary.to_string())
Percentage of Null Values ANIO 0.0000 ID_PERSONA 0.0000 LLAVE_ENTIDAD 0.0000 LLAVE_MUNICIPIO 0.0000 CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) 0.0000 LLAVE_LOCALIDAD 0.0000 CLAVE_LOCALIDAD_INEGI 71.0000 ID_VIVIENDA 0.0000 ID_HOGAR 78.0000 LLAVE_COBERTURA 0.0000 LLAVE_CLASEVIVIENDA 0.0000 LLAVE_SEXO 0.0000 LLAVE_PARENTESCO 0.0000 LLAVE_IDENTMADRE 0.0000 LLAVE_IDENTPADRE 0.0000 LLAVE_PAIS_NAC 0.0000 LLAVE_ENTIDAD_NAC 0.0000 LLAVE_NACIONALIDAD 0.0000 LLAVE_SERSALUD 0.0000 LLAVE_AFRODES 0.0000 LLAVE_REGISNAC 0.0000 LLAVE_RELIGION 0.0000 LLAVE_HLENGUA 0.0000 LLAVE_LENGUAMAT 0.0000 LLAVE_HESPANOL 0.0000 LLAVE_ELENGUA 0.0000 LLAVE_PERTEINDIGENA 0.0000 LLAVE_ASISESCOLAR 0.0000 LLAVE_PAIS_ASISESCOLAR 0.0000 LLAVE_ENTIDAD_ASISESCOLAR 0.0000 LLAVE_MUNICIPIO_ASISESCOLAR 0.0000 LLAVE_TIETRASLADO_ESCOLAR 0.0000 LLAVE_MEDTRASLADO_ESCOLAR 0.0000 LLAVE_NIVACAD 0.0000 LLAVE_CARRERA 0.0000 LLAVE_ALFABETISMO 0.0000 LLAVE_PAIS_RES5A 0.0000 LLAVE_ENTIDAD_RES5A 0.0000 LLAVE_MUNICIPIO_RES5A 0.0000 LLAVE_CAUSAMIGRACION 0.0000 LLAVE_SITUACONYUGAL 0.0000 LLAVE_IDENTPAREJA 0.0000 LLAVE_ACTPRIMARIA 0.0000 LLAVE_OCUPACION 0.0000 LLAVE_SITTRA 0.0000 LLAVE_ACTECONOMICA 0.0000 ACTIVIDAD_ECONOMICA_INEGI 66.0000 LLAVE_PAIS_TRABAJO 0.0000 LLAVE_ENTIDAD_TRABAJO 0.0000 LLAVE_MUNICIPIO_TRABAJO 0.0000 LLAVE_TIETRASLADO_TRABAJO 0.0000 LLAVE_MEDTRASLADO_TRABAJO 0.0000 LLAVE_TAMLOC 0.0000 ESTRATO 0.0000 UPM 0.0000 FACTOR_EXP 0.0000 NUMPERSONA 22.0000 EDAD 0.0000 ESCOLARIDAD 8.0000 ESCOLARIDAD_ACUMULADA 8.0000 INGRESO 0.0000 HORAS_TRABAJADAS 77.0000 HIJOS_NACIDOS 65.0000 HIJOS_VIVOS 74.0000 HIJOS_FALLECIDOS 74.0000 MERCADO_TRABAJO_LOCAL 0.0000
🧹 Handling Missing Data
To ensure the quality of our regression models, we remove variables with a high percentage of missing values. However, since education is a central variable in our analysis, we adjust the threshold for filtering.
⚙️ Filtering Rule:
- Remove columns with more than 10% missing values.
- This threshold is a relaxed version of the more conservative 5% typically used.
- The goal is to preserve key variables (especially education-related ones) while still reducing noise and data sparsity.
🔍 Why this matters:
Removing columns with excessive missing data helps prevent:
- Model bias from uneven imputation
- Overfitting due to sparse feature space
- Unstable coefficient estimates
We later apply imputation strategies (e.g., median for numeric values and most frequent for categorical variables) during preprocessing to handle the remaining missing values.
# ✅ Filter columns based on missing value threshold
missing_percentage = df.isnull().mean() * 100
columns_to_keep = missing_percentage[missing_percentage <= 10].index
df_filtered = df[columns_to_keep]
# 📊 Recalculate the null percentages after filtering
null_percentage = np.round(100 - df_filtered.notnull().mean() * 100)
non_null_percentage_df = pd.DataFrame(null_percentage, columns=['Percentage of Null Values'])
print(non_null_percentage_df)
# Count the number of remaining rows
remaining_cols = df_filtered.shape[1]
# Print the number of remaining rows
print("Number of remaining columns after removing those with >10% missing values:", remaining_cols)
Percentage of Null Values ANIO 0.0000 ID_PERSONA 0.0000 LLAVE_ENTIDAD 0.0000 LLAVE_MUNICIPIO 0.0000 CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) 0.0000 LLAVE_LOCALIDAD 0.0000 ID_VIVIENDA 0.0000 LLAVE_COBERTURA 0.0000 LLAVE_CLASEVIVIENDA 0.0000 LLAVE_SEXO 0.0000 LLAVE_PARENTESCO 0.0000 LLAVE_IDENTMADRE 0.0000 LLAVE_IDENTPADRE 0.0000 LLAVE_PAIS_NAC 0.0000 LLAVE_ENTIDAD_NAC 0.0000 LLAVE_NACIONALIDAD 0.0000 LLAVE_SERSALUD 0.0000 LLAVE_AFRODES 0.0000 LLAVE_REGISNAC 0.0000 LLAVE_RELIGION 0.0000 LLAVE_HLENGUA 0.0000 LLAVE_LENGUAMAT 0.0000 LLAVE_HESPANOL 0.0000 LLAVE_ELENGUA 0.0000 LLAVE_PERTEINDIGENA 0.0000 LLAVE_ASISESCOLAR 0.0000 LLAVE_PAIS_ASISESCOLAR 0.0000 LLAVE_ENTIDAD_ASISESCOLAR 0.0000 LLAVE_MUNICIPIO_ASISESCOLAR 0.0000 LLAVE_TIETRASLADO_ESCOLAR 0.0000 LLAVE_MEDTRASLADO_ESCOLAR 0.0000 LLAVE_NIVACAD 0.0000 LLAVE_CARRERA 0.0000 LLAVE_ALFABETISMO 0.0000 LLAVE_PAIS_RES5A 0.0000 LLAVE_ENTIDAD_RES5A 0.0000 LLAVE_MUNICIPIO_RES5A 0.0000 LLAVE_CAUSAMIGRACION 0.0000 LLAVE_SITUACONYUGAL 0.0000 LLAVE_IDENTPAREJA 0.0000 LLAVE_ACTPRIMARIA 0.0000 LLAVE_OCUPACION 0.0000 LLAVE_SITTRA 0.0000 LLAVE_ACTECONOMICA 0.0000 LLAVE_PAIS_TRABAJO 0.0000 LLAVE_ENTIDAD_TRABAJO 0.0000 LLAVE_MUNICIPIO_TRABAJO 0.0000 LLAVE_TIETRASLADO_TRABAJO 0.0000 LLAVE_MEDTRASLADO_TRABAJO 0.0000 LLAVE_TAMLOC 0.0000 ESTRATO 0.0000 UPM 0.0000 FACTOR_EXP 0.0000 EDAD 0.0000 ESCOLARIDAD 8.0000 ESCOLARIDAD_ACUMULADA 8.0000 INGRESO 0.0000 MERCADO_TRABAJO_LOCAL 0.0000 Number of remaining columns after removing those with >10% missing values: 58
After filtering out columns with too much missingness, we proceed to remove rows that still contain missing values in the remaining features.
# 🧹 Drop all rows with any remaining missing values
df_cleaned = df_filtered.dropna()
# 📏 Show how much data is left
remaining_rows = df_cleaned.shape[0]
print("Number of remaining rows after removing rows with missing values:", remaining_rows)
Number of remaining rows after removing rows with missing values: 65861
📊 Overview of the Cleaned Dataset
Once missing values are handled, we explore the structure and distribution of the remaining variables using df.info() and df.describe().
These steps give us a first idea of:
- Which variables are numeric or categorical
- Their basic distribution: means, min/max, standard deviations
- Whether further preprocessing (scaling, encoding) is needed before modeling
# Display information about the remaining columns
df_cleaned.info()
# Display descriptive statistics for the remaining columns
df_cleaned.describe()
<class 'pandas.core.frame.DataFrame'> Int64Index: 65861 entries, 0 to 71999 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ANIO 65861 non-null category 1 ID_PERSONA 65861 non-null object 2 LLAVE_ENTIDAD 65861 non-null int64 3 LLAVE_MUNICIPIO 65861 non-null int64 4 CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) 65861 non-null int64 5 LLAVE_LOCALIDAD 65861 non-null int64 6 ID_VIVIENDA 65861 non-null int64 7 LLAVE_COBERTURA 65861 non-null int64 8 LLAVE_CLASEVIVIENDA 65861 non-null int64 9 LLAVE_SEXO 65861 non-null category 10 LLAVE_PARENTESCO 65861 non-null category 11 LLAVE_IDENTMADRE 65861 non-null category 12 LLAVE_IDENTPADRE 65861 non-null category 13 LLAVE_PAIS_NAC 65861 non-null category 14 LLAVE_ENTIDAD_NAC 65861 non-null category 15 LLAVE_NACIONALIDAD 65861 non-null category 16 LLAVE_SERSALUD 65861 non-null category 17 LLAVE_AFRODES 65861 non-null category 18 LLAVE_REGISNAC 65861 non-null category 19 LLAVE_RELIGION 65861 non-null category 20 LLAVE_HLENGUA 65861 non-null category 21 LLAVE_LENGUAMAT 65861 non-null category 22 LLAVE_HESPANOL 65861 non-null category 23 LLAVE_ELENGUA 65861 non-null category 24 LLAVE_PERTEINDIGENA 65861 non-null category 25 LLAVE_ASISESCOLAR 65861 non-null category 26 LLAVE_PAIS_ASISESCOLAR 65861 non-null category 27 LLAVE_ENTIDAD_ASISESCOLAR 65861 non-null category 28 LLAVE_MUNICIPIO_ASISESCOLAR 65861 non-null category 29 LLAVE_TIETRASLADO_ESCOLAR 65861 non-null category 30 LLAVE_MEDTRASLADO_ESCOLAR 65861 non-null category 31 LLAVE_NIVACAD 65861 non-null category 32 LLAVE_CARRERA 65861 non-null category 33 LLAVE_ALFABETISMO 65861 non-null category 34 LLAVE_PAIS_RES5A 65861 non-null category 35 LLAVE_ENTIDAD_RES5A 65861 non-null category 36 LLAVE_MUNICIPIO_RES5A 65861 non-null category 37 LLAVE_CAUSAMIGRACION 65861 non-null category 38 LLAVE_SITUACONYUGAL 65861 non-null category 39 LLAVE_IDENTPAREJA 65861 non-null category 40 LLAVE_ACTPRIMARIA 65861 non-null category 41 LLAVE_OCUPACION 65861 non-null category 42 LLAVE_SITTRA 65861 non-null category 43 LLAVE_ACTECONOMICA 65861 non-null category 44 LLAVE_PAIS_TRABAJO 65861 non-null category 45 LLAVE_ENTIDAD_TRABAJO 65861 non-null category 46 LLAVE_MUNICIPIO_TRABAJO 65861 non-null category 47 LLAVE_TIETRASLADO_TRABAJO 65861 non-null category 48 LLAVE_MEDTRASLADO_TRABAJO 65861 non-null category 49 LLAVE_TAMLOC 65861 non-null category 50 ESTRATO 65861 non-null category 51 UPM 65861 non-null category 52 FACTOR_EXP 65861 non-null int64 53 EDAD 65861 non-null float64 54 ESCOLARIDAD 65861 non-null float64 55 ESCOLARIDAD_ACUMULADA 65861 non-null float64 56 INGRESO 65861 non-null int64 57 MERCADO_TRABAJO_LOCAL 65861 non-null category dtypes: category(45), float64(3), int64(9), object(1) memory usage: 12.6+ MB
| LLAVE_ENTIDAD | LLAVE_MUNICIPIO | CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM) | LLAVE_LOCALIDAD | ID_VIVIENDA | LLAVE_COBERTURA | LLAVE_CLASEVIVIENDA | FACTOR_EXP | EDAD | ESCOLARIDAD | ESCOLARIDAD_ACUMULADA | INGRESO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 | 65861.0000 |
| mean | 17.7284 | 1077.0988 | 76.1650 | 2610375.9797 | 2012058901208914.5000 | 1.0303 | 0.6500 | 7.9047 | 30.8076 | 2.9827 | 6.3126 | 7632.9772 |
| std | 7.8855 | 759.5165 | 99.5664 | 5774932.5622 | 7012302161285.0986 | 0.9433 | 1.0904 | 12.5169 | 20.4113 | 1.8101 | 4.6697 | 779328.3591 |
| min | 1.0000 | 1.0000 | 1.0000 | 0.0000 | 2000000000000053.0000 | 0.0000 | 0.0000 | 1.0000 | 3.0000 | 0.0000 | 0.0000 | 0.0000 |
| 25% | 12.0000 | 403.0000 | 17.0000 | 0.0000 | 2010000000581423.0000 | 0.0000 | 0.0000 | 2.0000 | 14.0000 | 2.0000 | 2.0000 | 0.0000 |
| 50% | 16.0000 | 894.0000 | 42.0000 | 0.0000 | 2015110140003917.0000 | 1.0000 | 0.0000 | 4.0000 | 27.0000 | 3.0000 | 6.0000 | 0.0000 |
| 75% | 23.0000 | 1804.0000 | 95.0000 | 280139.0000 | 2015301900000550.0000 | 2.0000 | 1.0000 | 9.0000 | 44.0000 | 4.0000 | 9.0000 | 1500.0000 |
| max | 32.0000 | 2469.0000 | 570.0000 | 24550042.0000 | 2020320580000274.0000 | 2.0000 | 10.0000 | 646.0000 | 110.0000 | 9.0000 | 24.0000 | 99999999.0000 |
👷♀️ Restricting the Analysis to Adults Likely to Work Full-Time
Although the legal working age in Mexico begins at 15 years old, this analysis focuses on individuals aged 18 and older. The reason for this is that people under 18 are less likely to:
- Participate in the labor market on a full-time basis
- Have completed their basic education
- Earn stable or comparable income across time
# Filter individuals aged 18 or older
df_cleaned_w_age = df_cleaned[df_cleaned['EDAD'] >= 18].copy()
# Optional: check the shape
print(f"✅ Working-age dataset shape: {df_cleaned_w_age.shape}")
✅ Working-age dataset shape: (43943, 58)
💰 Transforming the Income Variable
The original income variable (INGRESO) shows a high degree of skewness and dispersion, with a long right tail of high earners. This can distort the performance and interpretation of regression models, especially those assuming normally distributed errors.
To address this, we apply a logarithmic transformation:
df_cleaned_w_age['LOG_INGRESO'] = np.log1p(df_cleaned_w_age['INGRESO']) # computes ln(1 + ingreso)
This transformation has several benefits:
- 📉 Reduces the impact of extreme income values (outliers)
- 🧮 Brings the distribution closer to normality
- ✅ Makes linear relationships easier to model
We visualize the effect below:
🔍 Distribution Before and After Transformation
- Left: Original income values
- Right: Log-transformed income values (ln(1 + INGRESO))
# Set style
sns.set(style="whitegrid")
# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
# Original Income
sns.histplot(df_cleaned_w_age['INGRESO'], bins=50, ax=axs[0], kde=True)
axs[0].set_title("Original Income Distribution")
axs[0].set_xlabel("INGRESO")
# Log Income
sns.histplot(df_cleaned_w_age['LOG_INGRESO'], bins=50, ax=axs[1], kde=True, color='orange')
axs[1].set_title("Log(1 + INGRESO) Distribution")
axs[1].set_xlabel("LOG_INGRESO")
plt.tight_layout()
plt.show()
🧹 Filtering to Focus on Income Earners
After filtering to the working-age population, we further restrict our dataset to individuals with positive income.
This step ensures that our analysis reflects the income dynamics only among people who are earning. Including individuals with zero income (e.g., students, unemployed, or informal workers without reported earnings) would introduce biases and reduce the explanatory power of education-related variables.
Zero-income individuals often belong to structurally different subgroups:
- In school/training
- Outside the labor force
- Working informally and underreported
- Not seeking employment
These groups could be influenced by different factors than those determining wage levels.
We then compute the natural logarithm of income using log1p() to stabilize variance and normalize the distribution.
# Filter first
df_positive_income = df_cleaned_w_age[df_cleaned_w_age['INGRESO'] > 0].copy()
# Then compute the log
df_positive_income['LOG_INGRESO'] = np.log1p(df_positive_income['INGRESO'])
# 🖨️ Optional: Check result
print(df_positive_income[['INGRESO', 'LOG_INGRESO']].describe())
INGRESO LOG_INGRESO count 18890.0000 18890.0000 mean 26477.1049 8.1559 std 1455029.4042 0.9159 min 2.0000 1.0986 25% 2143.0000 7.6704 50% 3857.0000 8.2579 75% 6000.0000 8.6997 max 99999999.0000 18.4207
🔍 Income Distribution Before and After Log Transformation
To address the strong right-skew in the income variable, we apply a log transformation to INGRESO and examine the difference in distribution:
- The left plot shows the original income distribution, where a majority of observations are concentrated near zero, with a long tail extending to very high values.
- The right plot displays the transformed distribution using log(1 + INGRESO). This transformation compresses extreme values and spreads out lower ones, making the data more symmetric and suitable for regression modeling.
We continue the analysis using the log-transformed income as the target variable: LOG_INGRESO.
# Set style
sns.set(style="whitegrid")
# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
# Original Income
sns.histplot(df_positive_income['INGRESO'], bins=50, ax=axs[0], kde=True)
axs[0].set_title("Original Income Distribution")
axs[0].set_xlabel("INGRESO")
# Log Income
sns.histplot(df_positive_income['LOG_INGRESO'], bins=50, ax=axs[1], kde=True, color='orange')
axs[1].set_title("Log(1 + INGRESO) Distribution")
axs[1].set_xlabel("LOG_INGRESO")
plt.tight_layout()
plt.show()
Train-Test Split¶
To evaluate model performance fairly, we divide our dataset into training and testing subsets. This ensures that we train our models on one portion of the data and test their predictive ability on unseen examples. • Training Set (90%): Used to train the model • Testing Set (10%): Held out for evaluating out-of-sample performance
# Define the features (X) and target (y)
X = df_positive_income.drop(columns=['LOG_INGRESO'])
y = df_positive_income['LOG_INGRESO']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Print the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
X_train shape: (17001, 58) X_test shape: (1889, 58) y_train shape: (17001,) y_test shape: (1889,)
Data Exploration¶
🔍 Data Exploration
Before modeling, we perform an initial exploration of the training dataset. This helps us understand:
- What types of variables we have (categorical vs. numerical)
- How variables like education, age, and income behave
- Whether any transformations or encodings will be needed
We begin by combining the training features and target variable into one dataset for easier inspection.
# Step 1: Concatenate X_train and y_train
train_data = pd.concat([X_train, y_train], axis=1)
train_data.columns
Index(['ANIO', 'ID_PERSONA', 'LLAVE_ENTIDAD', 'LLAVE_MUNICIPIO',
'CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM)', 'LLAVE_LOCALIDAD',
'ID_VIVIENDA', 'LLAVE_COBERTURA', 'LLAVE_CLASEVIVIENDA', 'LLAVE_SEXO',
'LLAVE_PARENTESCO', 'LLAVE_IDENTMADRE', 'LLAVE_IDENTPADRE',
'LLAVE_PAIS_NAC', 'LLAVE_ENTIDAD_NAC', 'LLAVE_NACIONALIDAD',
'LLAVE_SERSALUD', 'LLAVE_AFRODES', 'LLAVE_REGISNAC', 'LLAVE_RELIGION',
'LLAVE_HLENGUA', 'LLAVE_LENGUAMAT', 'LLAVE_HESPANOL', 'LLAVE_ELENGUA',
'LLAVE_PERTEINDIGENA', 'LLAVE_ASISESCOLAR', 'LLAVE_PAIS_ASISESCOLAR',
'LLAVE_ENTIDAD_ASISESCOLAR', 'LLAVE_MUNICIPIO_ASISESCOLAR',
'LLAVE_TIETRASLADO_ESCOLAR', 'LLAVE_MEDTRASLADO_ESCOLAR',
'LLAVE_NIVACAD', 'LLAVE_CARRERA', 'LLAVE_ALFABETISMO',
'LLAVE_PAIS_RES5A', 'LLAVE_ENTIDAD_RES5A', 'LLAVE_MUNICIPIO_RES5A',
'LLAVE_CAUSAMIGRACION', 'LLAVE_SITUACONYUGAL', 'LLAVE_IDENTPAREJA',
'LLAVE_ACTPRIMARIA', 'LLAVE_OCUPACION', 'LLAVE_SITTRA',
'LLAVE_ACTECONOMICA', 'LLAVE_PAIS_TRABAJO', 'LLAVE_ENTIDAD_TRABAJO',
'LLAVE_MUNICIPIO_TRABAJO', 'LLAVE_TIETRASLADO_TRABAJO',
'LLAVE_MEDTRASLADO_TRABAJO', 'LLAVE_TAMLOC', 'ESTRATO', 'UPM',
'FACTOR_EXP', 'EDAD', 'ESCOLARIDAD', 'ESCOLARIDAD_ACUMULADA', 'INGRESO',
'MERCADO_TRABAJO_LOCAL', 'LOG_INGRESO'],
dtype='object')
To prepare the data for modeling and visualization, we classify the variables into continuous and categorical types. This distinction will guide our preprocessing and help tailor our feature engineering steps later on.
We also enforce appropriate data types to ensure compatibility with scikit-learn pipelines and other libraries.
# 1) Setup
# Set style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (15, 5)
# 2) Define continuous variables
continuous_vars = [
"EDAD",
"ESCOLARIDAD",
"ESCOLARIDAD_ACUMULADA",
"FACTOR_EXP",
"INGRESO",
"LOG_INGRESO"
]
# 3) Define categorical variables as the remaining columns
all_columns = train_data.columns.tolist()
categorical_vars = list(set(all_columns) - set(continuous_vars))
# 4) Force types
#train data
train_data[continuous_vars] = train_data[continuous_vars].astype(float)
train_data[categorical_vars] = train_data[categorical_vars].astype("category")
#test data
train_data[continuous_vars] = train_data[continuous_vars].astype(float)
train_data[categorical_vars] = train_data[categorical_vars].astype("category")
# 5) Done!
print("✅ Continuous Variables:", continuous_vars)
print("\n✅ Categorical Variables:", categorical_vars)
✅ Continuous Variables: ['EDAD', 'ESCOLARIDAD', 'ESCOLARIDAD_ACUMULADA', 'FACTOR_EXP', 'INGRESO', 'LOG_INGRESO'] ✅ Categorical Variables: ['LLAVE_TIETRASLADO_TRABAJO', 'ID_VIVIENDA', 'LLAVE_PERTEINDIGENA', 'LLAVE_REGISNAC', 'LLAVE_ENTIDAD_RES5A', 'LLAVE_MUNICIPIO_TRABAJO', 'LLAVE_RELIGION', 'LLAVE_MEDTRASLADO_ESCOLAR', 'MERCADO_TRABAJO_LOCAL', 'LLAVE_SEXO', 'LLAVE_CARRERA', 'LLAVE_ENTIDAD_ASISESCOLAR', 'LLAVE_PAIS_ASISESCOLAR', 'LLAVE_MUNICIPIO_ASISESCOLAR', 'LLAVE_ENTIDAD_NAC', 'LLAVE_IDENTPAREJA', 'LLAVE_MUNICIPIO', 'LLAVE_NIVACAD', 'LLAVE_ACTECONOMICA', 'ANIO', 'LLAVE_COBERTURA', 'LLAVE_SERSALUD', 'LLAVE_PAIS_NAC', 'ESTRATO', 'LLAVE_NACIONALIDAD', 'LLAVE_CAUSAMIGRACION', 'LLAVE_HESPANOL', 'LLAVE_ALFABETISMO', 'UPM', 'LLAVE_ACTPRIMARIA', 'LLAVE_ELENGUA', 'LLAVE_SITTRA', 'LLAVE_AFRODES', 'LLAVE_TIETRASLADO_ESCOLAR', 'LLAVE_CLASEVIVIENDA', 'LLAVE_PAIS_RES5A', 'LLAVE_MUNICIPIO_RES5A', 'CLAVE_MUNICIPIO_INEGI(CLAVE_DE_AGEM)', 'LLAVE_IDENTMADRE', 'LLAVE_TAMLOC', 'LLAVE_LENGUAMAT', 'LLAVE_SITUACONYUGAL', 'LLAVE_ENTIDAD_TRABAJO', 'LLAVE_LOCALIDAD', 'LLAVE_PAIS_TRABAJO', 'LLAVE_IDENTPADRE', 'LLAVE_ASISESCOLAR', 'LLAVE_OCUPACION', 'LLAVE_HLENGUA', 'ID_PERSONA', 'LLAVE_MEDTRASLADO_TRABAJO', 'LLAVE_ENTIDAD', 'LLAVE_PARENTESCO']
📊 Visualizing Continuous Variables
To better understand the distribution and relationships between numeric features, we plot:
- Histograms with KDE curves for each variable
- A pairplot to observe potential correlations and non-linear patterns
# 3) Plot Continuous Variables (Histograms)
n_cols = 3
n_rows = -(-len(continuous_vars) // n_cols) # Ceiling division
fig, axs = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
axs = axs.flatten()
for i, col in enumerate(continuous_vars):
sns.histplot(data=train_data, x=col, ax=axs[i], kde=True, bins=30)
axs[i].set_title(f"Distribution of {col}")
axs[i].tick_params(axis='x', rotation=45)
# Remove unused subplots
for j in range(i + 1, len(axs)):
fig.delaxes(axs[j])
plt.tight_layout()
plt.show()
# Plot pairplot to visualize relationships between features
sns.pairplot(train_data[continuous_vars])
plt.suptitle("Pairplot of Features")
plt.show()
To explore linear relationships between the continuous variables in our dataset, we compute the correlation matrix. This helps identify:
- Potential multicollinearity issues
- Strong predictors of income (INGRESO or LOG_INGRESO)
- Redundancy or independence among variables
# Calculate the correlation matrix
correlation_matrix = train_data.corr()
# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of the Data')
plt.show()
/tmp/ipykernel_101637/541619347.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = train_data.corr()
📊 Distribution of Categorical Variables
To understand the composition of our dataset, we visualize the most frequent categories across all categorical features. This helps:
- Detect imbalanced variables
- Identify common groups for features like RELIGION, SEXO, ACTPRIMARIA, etc.
- Inform decisions on encoding techniques (e.g., one-hot vs. target encoding)
We plot only the top 10 categories for each variable to keep the charts interpretable.
# 4) 4. Plot Categorical Variables (Bar Plots)
n_cols = 3
n_rows = -(-len(categorical_vars) // n_cols)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
axs = axs.flatten()
for i, col in enumerate(categorical_vars):
sns.countplot(data=train_data, x=col, ax=axs[i], order=df[col].value_counts().index[:10])
axs[i].set_title(f"Top 10 Categories of {col}")
axs[i].tick_params(axis='x', rotation=45)
# Remove unused subplots
for j in range(i + 1, len(axs)):
fig.delaxes(axs[j])
plt.tight_layout()
plt.show()
Preparing the Data for Machine Learning¶
After the exploratory phase, including the inspection of distributions and correlation structures, we now prepare the data for predictive modeling.
This step involves:
- Handling missing values (already addressed)
- Engineering polynomial and interaction terms
- Scaling and encoding features appropriately
- Building preprocessing pipelines for continuous and categorical variables
This transformation ensures that the data is clean, structured, and model-ready, allowing us to evaluate the importance of education and other variables in predicting income.
🎯 Variable Selection Strategy
From the extensive list of variables retained after the exploratory phase, we apply two key criteria to select the most relevant predictors for modeling:
Theoretical Justification / Literature Review: Variables that are widely recognized in the academic literature as important determinants of income—such as education, age, sex, and occupational characteristics—are prioritized. These variables are expected to capture human capital, demographic effects, and labor market positioning.
Data Quality and Signal Strength: While many variables had acceptable missing data rates (below 10.5%), some had categories overwhelmingly dominated by “Not specified” or similar responses. These variables likely introduce more noise than signal and were therefore excluded.
Additionally:
- In cases of strong multicollinearity (e.g., ESCOLARIDAD and ESCOLARIDAD_ACUMULADA), we retain only the variable that is more interpretable or better suited to our research objective.
- We exclude high-cardinality identifiers (like ID_PERSONA, ID_VIVIENDA) and variables that are not meaningful for income prediction.
✅ The result is a refined set of features that are both substantively relevant and statistically sound, ensuring better model performance and interpretability.
# 🎯 Target Variable
target_variable = ['LOG_INGRESO']
# 🌍 Geographical / Labor Market Context
geographical_context = ['MERCADO_TRABAJO_LOCAL'] # Local Labor Market
# 📅 Temporal Variable
temporal_variable = ['ANIO']
# 🛠️ Work and Commute Characteristics
work_characteristics = [
'LLAVE_TIETRASLADO_TRABAJO', # Commute time to work
'LLAVE_ACTPRIMARIA' # Primary activity
]
# 🎓 Education
education_variables = [
'ESCOLARIDAD_ACUMULADA', # Accumulated schooling
]
# 🧍 Sociodemographic Characteristics
sociodemographic_variables = [
'EDAD', # Age
'LLAVE_SITUACONYUGAL', # Marital status
'LLAVE_PERTEINDIGENA', # Indigenous identity
'LLAVE_RELIGION', # Religion
'LLAVE_AFRODES', # Afrodescendant identity
'LLAVE_SEXO' # Sex
]
# ✅ Combine all selected variables into one list
selected_columns = (
target_variable +
geographical_context +
temporal_variable +
work_characteristics +
education_variables +
sociodemographic_variables
)
# 🎯 Create a new DataFrame with only the selected variables
train_data_selected = train_data[selected_columns].copy()
# 🖨️ Optional: Check shape and preview
print(f"✅ Selected dataset shape: {train_data_selected.shape}")
train_data_selected.head(5)
✅ Selected dataset shape: (17001, 12)
| LOG_INGRESO | MERCADO_TRABAJO_LOCAL | ANIO | LLAVE_TIETRASLADO_TRABAJO | LLAVE_ACTPRIMARIA | ESCOLARIDAD_ACUMULADA | EDAD | LLAVE_SITUACONYUGAL | LLAVE_PERTEINDIGENA | LLAVE_RELIGION | LLAVE_AFRODES | LLAVE_SEXO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24134 | 8.6307 | 640 | 2010 | 0 | 1 | 12.0000 | 43.0000 | 8 | 3 | 1 | 0 | 2 |
| 9692 | 8.2943 | 562 | 2000 | 0 | 1 | 16.0000 | 28.0000 | 9 | 3 | 1 | 0 | 2 |
| 71831 | 8.4154 | 438 | 2020 | 5 | 1 | 0.0000 | 71.0000 | 4 | 1 | 1 | 3 | 1 |
| 35758 | 7.8524 | 65 | 2015 | 1 | 1 | 9.0000 | 29.0000 | 1 | 3 | 0 | 3 | 1 |
| 40856 | 6.7546 | 478 | 2015 | 6 | 1 | 9.0000 | 24.0000 | 1 | 1 | 0 | 3 | 2 |
# Describe the continous variables
train_data_selected.describe()
| LOG_INGRESO | ESCOLARIDAD_ACUMULADA | EDAD | |
|---|---|---|---|
| count | 17001.0000 | 17001.0000 | 17001.0000 |
| mean | 8.1525 | 8.8450 | 38.1140 |
| std | 0.9150 | 4.5599 | 13.4704 |
| min | 1.0986 | 0.0000 | 18.0000 |
| 25% | 7.6704 | 6.0000 | 27.0000 |
| 50% | 8.2430 | 9.0000 | 36.0000 |
| 75% | 8.6997 | 12.0000 | 47.0000 |
| max | 18.4207 | 24.0000 | 98.0000 |
📈 Income Trends by Education Level Across Time
To visually explore how the importance of education in determining income has evolved over time, we compute and plot the average (log) income by education level for different years.
# Show all rows in the terminal output
pd.set_option('display.max_rows', None)
# Group by ANIO and ESCOLARIDAD_ACUMULADA and calculate average log income
avg_income_by_year_edu = train_data_selected.groupby(
['ANIO', 'ESCOLARIDAD_ACUMULADA']
)['LOG_INGRESO'].mean().reset_index()
# Print full output
print(avg_income_by_year_edu)
# Optionally reset the display setting afterward
pd.reset_option('display.max_rows')
ANIO ESCOLARIDAD_ACUMULADA LOG_INGRESO 0 2000 0.0000 6.8698 1 2000 1.0000 6.9846 2 2000 2.0000 7.0178 3 2000 3.0000 7.1629 4 2000 4.0000 7.2527 5 2000 5.0000 7.1382 6 2000 6.0000 7.3952 7 2000 7.0000 7.3890 8 2000 8.0000 7.5516 9 2000 9.0000 7.5846 10 2000 10.0000 7.8494 11 2000 11.0000 7.7391 12 2000 12.0000 7.9563 13 2000 13.0000 8.2580 14 2000 14.0000 8.1859 15 2000 15.0000 8.1893 16 2000 16.0000 8.7380 17 2000 17.0000 8.7135 18 2000 18.0000 9.2569 19 2000 19.0000 9.1968 20 2000 20.0000 9.4468 21 2000 21.0000 9.3016 22 2000 22.0000 8.7484 23 2000 23.0000 NaN 24 2000 24.0000 NaN 25 2010 0.0000 7.4244 26 2010 1.0000 7.7043 27 2010 2.0000 7.4532 28 2010 3.0000 7.6519 29 2010 4.0000 7.8150 30 2010 5.0000 7.8052 31 2010 6.0000 7.9483 32 2010 7.0000 7.9479 33 2010 8.0000 8.1394 34 2010 9.0000 8.0575 35 2010 10.0000 8.1022 36 2010 11.0000 8.3763 37 2010 12.0000 8.3502 38 2010 13.0000 8.5930 39 2010 14.0000 8.5883 40 2010 15.0000 8.6288 41 2010 16.0000 8.9267 42 2010 17.0000 9.1260 43 2010 18.0000 9.1460 44 2010 19.0000 9.3403 45 2010 20.0000 9.7494 46 2010 21.0000 9.9624 47 2010 22.0000 NaN 48 2010 23.0000 NaN 49 2010 24.0000 NaN 50 2015 0.0000 7.6844 51 2015 1.0000 7.7349 52 2015 2.0000 7.8716 53 2015 3.0000 7.9323 54 2015 4.0000 7.8827 55 2015 5.0000 7.9046 56 2015 6.0000 8.0523 57 2015 7.0000 8.1214 58 2015 8.0000 8.2185 59 2015 9.0000 8.2381 60 2015 10.0000 8.3070 61 2015 11.0000 8.2810 62 2015 12.0000 8.3514 63 2015 13.0000 8.4747 64 2015 14.0000 8.6226 65 2015 15.0000 8.6560 66 2015 16.0000 8.9743 67 2015 17.0000 9.0142 68 2015 18.0000 9.1935 69 2015 19.0000 9.2770 70 2015 20.0000 9.1559 71 2015 21.0000 9.5388 72 2015 22.0000 9.4152 73 2015 23.0000 NaN 74 2015 24.0000 7.7191 75 2020 0.0000 7.7593 76 2020 1.0000 7.6609 77 2020 2.0000 7.8807 78 2020 3.0000 7.9827 79 2020 4.0000 8.1074 80 2020 5.0000 8.1547 81 2020 6.0000 8.1983 82 2020 7.0000 8.3857 83 2020 8.0000 8.3601 84 2020 9.0000 8.4159 85 2020 10.0000 8.4142 86 2020 11.0000 8.3832 87 2020 12.0000 8.5121 88 2020 13.0000 8.8755 89 2020 14.0000 8.6012 90 2020 15.0000 8.8726 91 2020 16.0000 9.0056 92 2020 17.0000 9.0986 93 2020 18.0000 9.5325 94 2020 19.0000 8.9598 95 2020 20.0000 9.8716 96 2020 21.0000 9.8099 97 2020 22.0000 9.0172 98 2020 23.0000 9.5570 99 2020 24.0000 NaN
This grouped data is visualized using a line plot, where each line corresponds to a different year.
# Remove NaNs to avoid plot issues
avg_income_by_year_edu_clean = avg_income_by_year_edu.dropna()
plt.figure(figsize=(12, 6))
sns.lineplot(data=avg_income_by_year_edu_clean,
x='ESCOLARIDAD_ACUMULADA',
y='LOG_INGRESO',
hue='ANIO',
palette='tab10')
plt.title("Average Income by Education Level Across Years")
plt.xlabel("Accumulated Education")
plt.ylabel("Average Income")
plt.legend(title="Year")
plt.tight_layout()
plt.show()
To prepare our dataset for modeling, we divide the selected predictors into numerical and categorical features. This is essential for building preprocessing pipelines and ensuring that each variable is treated appropriately during transformation and model fitting.
# ✅ Split into categorical and continuous variable lists
numerical_features = [
'ESCOLARIDAD_ACUMULADA',
'EDAD'
]
categorical_features = [
'MERCADO_TRABAJO_LOCAL',
'LLAVE_TIETRASLADO_TRABAJO',
'LLAVE_ACTPRIMARIA',
'LLAVE_SITUACONYUGAL',
'LLAVE_PERTEINDIGENA',
'LLAVE_RELIGION',
'LLAVE_AFRODES',
'LLAVE_SEXO'
]
The number of categories in each of the variables is:
# Get categorical columns
categorical_features = train_data_selected.select_dtypes(include='category').columns
# Count number of unique categories for each categorical variable
category_counts = {col: train_data_selected[col].nunique() for col in categorical_features}
category_counts_df = pd.DataFrame.from_dict(category_counts, orient='index', columns=['Num Categories'])
# Sort by number of categories (optional)
category_counts_df = category_counts_df.sort_values(by='Num Categories', ascending=False)
# Display
print(category_counts_df)
Num Categories MERCADO_TRABAJO_LOCAL 724 LLAVE_RELIGION 49 LLAVE_SITUACONYUGAL 10 LLAVE_TIETRASLADO_TRABAJO 8 LLAVE_PERTEINDIGENA 5 LLAVE_AFRODES 5 ANIO 4 LLAVE_ACTPRIMARIA 2 LLAVE_SEXO 2
Preprocessing Strategy¶
To prepare the data for modeling, we must transform both numerical and categorical variables in ways that preserve useful information while ensuring compatibility with machine learning algorithms.
⸻
⚙️ Handling Missing Values
One common preprocessing challenge is dealing with missing values. Scikit-Learn provides a convenient SimpleImputer class for this purpose. For numerical variables, we typically fill missing values with the median, which is robust to outliers.
In our case, the variable ESCOLARIDAD_ACUMULADA (accumulated education) has a relatively high missing rate (~10%). Since this variable is central to our research question, we impute missing values using the median from the training set.
⸻
⚙️ Handling Categorical Variables
Categorical variables differ significantly in their number of unique values (cardinality). To encode them effectively:
- Low-cardinality variables (≤ 10 categories): We apply One-Hot Encoding, which creates a binary column for each category. This encoding is interpretable and suitable for linear models.
- High-cardinality variables (≫ 10 categories): We apply Target Encoding (also called Mean Encoding), which replaces each category with the average target value (in this case, income) for that category. This avoids sparse matrices and helps capture more informative patterns.
💡 In our dataset:
- MERCADO_TRABAJO_LOCAL has 774 unique values → encoded with Target Encoding
- All other categorical variables → encoded with One-Hot Encoding
⸻
⚙️ Handling Numerical Variables
Numerical variables are scaled using StandardScaler, which standardizes each feature to have a mean of 0 and a standard deviation of 1. This step is essential for regularized models such as Ridge and Lasso, where the scale of variables directly influences coefficient shrinkage.
⸻
🛠 Combined Preprocessing Pipeline
We use Scikit-Learn’s ColumnTransformer to combine the preprocessing steps into a unified pipeline:
- 📐 Numerical pipeline: SimpleImputer (median) + StandardScaler
- 🔢 Categorical pipeline (low-cardinality): OneHotEncoder
- 🎯 Categorical pipeline (high-cardinality): TargetEncoder
This structure allows us to handle mixed data types cleanly, reduce data leakage, and facilitate reproducibility and model tuning.
# 🔍 Identify high-cardinality categorical features
high_card_cat = ['MERCADO_TRABAJO_LOCAL'] # Contains many unique categories → Target Encoding
# 🏷️ Remaining categorical variables are considered low-cardinality
low_card_cat = list(set(categorical_features) - set(high_card_cat))
# 🎯 Define interaction features (e.g., education × year)
interaction_features = ['ESCOLARIDAD_ACUMULADA', 'ANIO']
# 🔢 Final list of numerical features, including those used for interactions
numerical_features = list(set(numerical_features + interaction_features))
# 🧮 Numerical pipeline: Imputation + Polynomial terms (interactions) + Scaling
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")), # Fill missing values with median
('poly', PolynomialFeatures(degree=2, include_bias=False)), # Add interaction & quadratic terms
('scaler', StandardScaler()) # Standardize features
])
# 🧠 High-cardinality pipeline: Target Encoding (mean encoding)
high_card_pipeline = Pipeline([
('target_encoder', TargetEncoder()) # Encode using mean income
])
# 🔤 Low-cardinality pipeline: Mode Imputation + One-Hot Encoding
low_card_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="most_frequent")), # Fill with most frequent category
('onehot', OneHotEncoder(handle_unknown="ignore")) # Handle categorical dummies
])
# 🔧 Combine all pipelines using ColumnTransformer
full_pipeline = ColumnTransformer([
('num', num_pipeline, numerical_features), # Apply to numeric columns
('low_card', low_card_pipeline, low_card_cat), # Apply to low-card categorical
('high_card', high_card_pipeline, high_card_cat) # Apply to high-card categorical
])
Before fitting any models, we separate the target variable (LOG_INGRESO) from the features, and use the preprocessing pipeline we defined earlier to transform the inputs.
# Separate features and target
X_train = train_data_selected.drop(columns=['LOG_INGRESO'])
y_train = train_data_selected['LOG_INGRESO']
# Fit the full pipeline and transform the data
X_train_prepared = full_pipeline.fit_transform(X_train, y_train)
# Now that the pipeline is fitted, extract feature names
feature_names = full_pipeline.get_feature_names_out()
Regression Models: Training, Evaluation, and Interpretation¶
⚙️ Modeling Workflow for Each Algorithm
For each model (Linear, Ridge, Lasso, Random Forest, Gradient Boosting), we follow the same logical structure:
- Model Definition & Cross-Validation
We define the model and use 5-fold cross-validation to evaluate:
- Mean Squared Error (MSE)
- Variance of prediction errors
- Model Training The model is retrained on the full training set to make use of all available data.
- Feature Importance
We analyze which features drive the model using one or more of the following:
- Coefficients (for linear models)
- Feature importances (for tree-based models)
- SHAP values, which provide a unified measure of each variable’s contribution
- Visualization
We plot:
- Top predictors using coefficient magnitudes or SHAP importance
📊 Model 1: Linear Regression
Linear Regression serves as our baseline model. It assumes a linear relationship between the predictors and the target variable (LOG_INGRESO). It’s simple, interpretable, and useful for understanding how each feature contributes to the prediction.
✅ What Linear Regression does:
- Estimates coefficients for each feature to minimize the sum of squared errors
- Provides a fully interpretable model with directly measurable effect sizes
- Assumes no regularization, which can be problematic in the presence of multicollinearity or many correlated features
# 1. Define the linear regression model
lr_model = LinearRegression()
# 2. Perform 5-fold cross-validation using negative MSE as the scoring metric
mse_scores = cross_val_score(lr_model, X_train_prepared, y_train, cv=5, scoring='neg_mean_squared_error')
# Convert negative MSE scores to positive values
mse_scores = -mse_scores
# Compute the mean and variance of the MSE across folds
mean_mse = mse_scores.mean()
variance_mse = mse_scores.var()
# Print evaluation metrics
print(f"🔁 Cross-validated Mean Squared Error (MSE): {mean_mse:.4f}")
print(f"📉 Variance of MSE: {variance_mse:.6f}")
# 3. Fit the model on the full training set
lr_model.fit(X_train_prepared, y_train)
# 4. Extract model coefficients and corresponding feature names
coefficients = lr_model.coef_
feature_names = full_pipeline.get_feature_names_out()
# Create a DataFrame to store and sort coefficients by absolute value
coef_df = pd.DataFrame({
"Feature": feature_names,
"Coefficient": coefficients
})
coef_df["Abs_Coefficient"] = coef_df["Coefficient"].abs()
coef_df_sorted = coef_df.sort_values("Abs_Coefficient", ascending=False)
# Print the top 15 most impactful features
print("\n📌 Top 15 most impactful features:")
print(coef_df_sorted.head(15))
🔁 Cross-validated Mean Squared Error (MSE): 0.4851
📉 Variance of MSE: 0.001328
📌 Top 15 most impactful features:
Feature Coefficient Abs_Coefficient
7 num__ANIO ESCOLARIDAD_ACUMULADA -21.7900 21.7900
2 num__ESCOLARIDAD_ACUMULADA 21.5968 21.5968
0 num__EDAD 7.2363 7.2363
4 num__EDAD ANIO -6.9485 6.9485
67 low_card__LLAVE_RELIGION_89 -1.7159 1.7159
40 low_card__LLAVE_RELIGION_18 -1.4317 1.4317
66 low_card__LLAVE_RELIGION_88 -1.1395 1.1395
94 high_card__MERCADO_TRABAJO_LOCAL 1.0628 1.0628
59 low_card__LLAVE_RELIGION_63 -0.9080 0.9080
52 low_card__LLAVE_RELIGION_31 -0.8552 0.8552
32 low_card__LLAVE_RELIGION_4 -0.8453 0.8453
76 low_card__LLAVE_RELIGION_213 0.8113 0.8113
62 low_card__LLAVE_RELIGION_79 0.7867 0.7867
72 low_card__LLAVE_RELIGION_99 -0.7764 0.7764
60 low_card__LLAVE_RELIGION_65 0.7648 0.7648
To assess the performance of the linear regression model, we visualize the relationship between the actual and predicted values of log-transformed income.
# ✅ 5. Generate predictions for the training data
y_pred = lr_model.predict(X_train_prepared)
# ✅ 6. Plot Actual vs Predicted values
plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_train, y=y_pred, alpha=0.4)
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--')
plt.xlabel("Actual LOG_INGRESO")
plt.ylabel("Predicted LOG_INGRESO")
plt.title("Actual vs Predicted Values (Linear Regression)")
plt.tight_layout()
plt.show()
To better interpret the results of our linear regression model, we visualize the top features ranked by the magnitude of their coefficients. This allows us to quickly identify which variables have the strongest association with income (positive or negative), and supports our goal of understanding the evolving role of education and other factors.
# Define the number of top features to display
top_n = 15 # You can change this number to show more or fewer features
# Plot the top coefficients as a horizontal bar chart
plt.figure(figsize=(10, 6))
coef_df_sorted.head(top_n).plot(
kind='barh', # Horizontal bar plot
x='Feature',
y='Coefficient',
color='skyblue',
legend=False
)
# Customize plot appearance
plt.title(f"Top {top_n} Coefficients in Linear Regression")
plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.grid(True, axis='x')
plt.tight_layout()
plt.gca().invert_yaxis() # Most important feature at the top
# Show the plot
plt.show()
<Figure size 1000x600 with 0 Axes>
To evaluate whether the importance of education in predicting income has increased over the years, we visualize the relationship between predicted income and years of education, grouped by year.
# Reconstruct the original features if needed
X_train_df = pd.DataFrame(X_train_prepared.toarray(), columns=feature_names)
# Add predictions back to a copy of the original X
X_vis = X_train.copy()
X_vis['Predicted_LOG_INGRESO'] = lr_model.predict(X_train_prepared)
# Select the most recent 5 years for plotting
years_to_plot = sorted(X_vis['ANIO'].unique())[-5:] # Last 5 years
# Create the plot
plt.figure(figsize=(10, 6))
for year in years_to_plot:
subset = X_vis[X_vis['ANIO'] == year]
sns.regplot(
data=subset,
x='ESCOLARIDAD_ACUMULADA',
y='Predicted_LOG_INGRESO',
label=str(year),
scatter=False # Show lines, not dots
)
# Customize the plot
plt.title("Effect of Education on Predicted Income by Year")
plt.xlabel("Years of Education")
plt.ylabel("Predicted Log(Income)")
plt.legend(title="Year")
plt.tight_layout()
plt.show()
To interpret the linear regression model beyond raw coefficients, we apply SHAP (SHapley Additive exPlanations) — a method grounded in cooperative game theory that attributes the impact of each feature on individual predictions.
While coefficients tell us the average direction and strength of a feature, SHAP values provide instance-level explanations, revealing how each feature pushes the prediction higher or lower.
We visualize the top 15 most influential features using:
- A bar plot summarizing average SHAP importance
- A beeswarm plot showing the distribution and direction of each feature’s impact
# ✅ SHAP explainer for Linear Regression
explainer = shap.LinearExplainer(lr_model, X_train_prepared, feature_perturbation="interventional")
# ✅ Compute SHAP values
shap_values = explainer(X_train_prepared)
# ✅ Use feature names in plot
shap_values.feature_names = full_pipeline.get_feature_names_out().tolist()
# SHAP beeswarm with names
shap.plots.beeswarm(shap_values, max_display=15)
# ✅ SHAP bar plot with feature names
shap.plots.bar(shap_values, max_display=15)
/users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/shap/explainers/_linear.py:99: FutureWarning: The feature_perturbation option is now deprecated in favor of using the appropriate masker (maskers.Independent, maskers.Partition or maskers.Impute). warnings.warn(wmsg, FutureWarning)
⚖️ Regularized Linear Models
In high-dimensional settings or when features are strongly correlated, plain linear regression can suffer from overfitting or unstable coefficient estimates. To address this, we apply regularization — a technique that adds a penalty to large model coefficients, helping control model complexity and improve generalization.
Instead of reducing the number of polynomial degrees (as often done in polynomial regression), we constrain the magnitude of the coefficients to prevent over-reliance on any single variable.
We explore two common regularized linear models:
- Lasso Regression (L1 penalty)
- Ridge Regression (L2 penalty)
These models offer a useful trade-off between interpretability and performance, especially when dealing with multicollinearity or large feature spaces.
📊 Model 2: Lasso
Lasso Regression is especially useful when you want to simplify your model and identify the most relevant predictors. Unlike Ridge, Lasso doesn’t just shrink coefficients—it can eliminate them entirely.
✅ What Lasso does:
- Performs L1 regularization, which penalizes the absolute value of coefficients
- Drives some coefficients to exactly zero, effectively performing feature selection
- Helps interpretability by identifying which variables truly matter
- Useful when you expect only a subset of features to be informative
# ✅ Fit Lasso Regression with cross-validation
# We use LassoCV to automatically select the best regularization strength (alpha)
# `max_iter` is increased to ensure convergence for complex datasets
lasso = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso.fit(X_train_prepared, y_train)
# 🔢 Report the best alpha selected during cross-validation
print(f"✅ Best alpha (λ): {lasso.alpha_:.5f}")
# 📊 Evaluate model performance on training set
y_pred = lasso.predict(X_train_prepared)
mse = mean_squared_error(y_train, y_pred) # Mean Squared Error
variance = np.var((y_train - y_pred) ** 2) # Variance of prediction errors
# 📉 Print performance metrics
print(f"📉 Lasso MSE: {mse:.4f}")
print(f"📉 Variance of prediction errors: {variance:.6f}")
/users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.3497430693141723, tolerance: 1.1200002937775073 model = cd_fast.sparse_enet_coordinate_descent( /users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.6329926910830181, tolerance: 1.130919567542444 model = cd_fast.sparse_enet_coordinate_descent( /users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.8582292876199062, tolerance: 1.130919567542444 model = cd_fast.sparse_enet_coordinate_descent( /users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.501920166414493, tolerance: 1.1276957066945341 model = cd_fast.sparse_enet_coordinate_descent( /users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.4375839819113025, tolerance: 1.151732078076371 model = cd_fast.sparse_enet_coordinate_descent(
✅ Best alpha (λ): 0.00041 📉 Lasso MSE: 0.4873 📉 Variance of prediction errors: 4.214954
/users/eleves-b/2024/anahi.reyes-miguel/.local/lib/python3.11/site-packages/sklearn/linear_model/_coordinate_descent.py:656: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 8.300917773422952, tolerance: 1.4232477372926446 model = cd_fast.sparse_enet_coordinate_descent(
# 📎 Extract Lasso coefficients and link them to feature names
feature_names = full_pipeline.get_feature_names_out()
lasso_coef_df = pd.DataFrame({
"Feature": feature_names,
"Coefficient": lasso.coef_
})
# 🧮 Count non-zero coefficients (i.e., features retained by Lasso)
nonzero = lasso_coef_df[lasso_coef_df["Coefficient"] != 0]
print(f"🧹 Non-zero coefficients: {len(nonzero)} out of {len(feature_names)}")
🧹 Non-zero coefficients: 37 out of 95
# 📊 Visualize the top 10 non-zero Lasso coefficients
# Sort by absolute magnitude of the coefficients
top = nonzero.reindex(nonzero["Coefficient"].abs().sort_values(ascending=False).index).head(10)
# Create horizontal bar chart
plt.figure(figsize=(8, 5))
plt.barh(top["Feature"], top["Coefficient"], color='salmon')
plt.xlabel("Coefficient Value")
plt.title("Top 10 Most Influential Features (Lasso Regression)")
plt.gca().invert_yaxis() # Highest value at the top
plt.tight_layout()
plt.show()
To better understand how each variable influences the predictions from our Lasso Regression model, we use SHAP (SHapley Additive exPlanations).
# ✅ Select the fitted Lasso model
model = lasso
# 🧠 Create a SHAP Explainer using the linear algorithm (suited for linear models)
explainer = shap.Explainer(model, X_train_prepared, feature_names=feature_names, algorithm="linear")
# 🔍 Compute SHAP values for the training data
shap_values = explainer(X_train_prepared)
# 📊 Beeswarm plot: shows the impact of each feature across all observations
shap.plots.beeswarm(shap_values, max_display=15)
# 📈 Bar plot: ranks features by average absolute SHAP value
shap.plots.bar(shap_values, max_display=15)
📊 Model 3: Ridge
Ridge Regression is a great next step. It’s similar to Lasso, but instead of eliminating features (like Lasso does), it shrinks all coefficients—especially helpful when you have multicollinearity or highly correlated predictors.
✅ What Ridge does:
- Performs L2 regularization, which reduces the size of coefficients but does not force them to zero
- Stabilizes the model when you have many correlated features (like polynomial and interaction terms)
- Is often more stable than Lasso when your predictors are highly collinear
# ✅ Fit Ridge Regression with cross-validated alpha
# RidgeCV automatically selects the best regularization strength (alpha) from a range
alphas = np.logspace(-6, 2, 100) # Explore alpha values from 10^-6 to 10^2
ridge = RidgeCV(alphas=alphas, cv=5)
ridge.fit(X_train_prepared, y_train)
# 🔢 Print selected alpha (regularization parameter)
print(f"✅ Best alpha (λ): {ridge.alpha_:.5f}")
# 📊 Evaluate model performance on training data
y_pred = ridge.predict(X_train_prepared)
mse = mean_squared_error(y_train, y_pred)
variance = np.var((y_train - y_pred) ** 2)
print(f"📉 Ridge MSE: {mse:.4f}")
print(f"📉 Variance of prediction errors: {variance:.6f}")
✅ Best alpha (λ): 0.00002 📉 Ridge MSE: 0.4803 📉 Variance of prediction errors: 4.143986
# 🔍 Extract coefficients and feature names from the fitted Ridge model
ridge_coefficients = ridge.coef_
feature_names = full_pipeline.get_feature_names_out()
# 🧾 Build a DataFrame linking each feature to its corresponding coefficient
coef_df = pd.DataFrame({
"Feature": feature_names,
"Coefficient": ridge_coefficients
})
# 📏 Compute absolute values to assess relative importance
coef_df["Abs_Coefficient"] = coef_df["Coefficient"].abs()
# 📊 Sort by absolute value and display top 15 features
top_ridge_coefs = coef_df.sort_values("Abs_Coefficient", ascending=False).head(15)
# 📋 Display the table
print("📌 Top 15 Features by Ridge Coefficient Magnitude:\n")
print(top_ridge_coefs[["Feature", "Coefficient"]].to_string(index=False))
📌 Top 15 Features by Ridge Coefficient Magnitude:
Feature Coefficient
num__ANIO ESCOLARIDAD_ACUMULADA -21.8273
num__ESCOLARIDAD_ACUMULADA 21.6335
num__EDAD 7.7565
num__EDAD ANIO -7.4702
low_card__LLAVE_RELIGION_89 -1.7222
low_card__LLAVE_RELIGION_18 -1.4275
low_card__LLAVE_RELIGION_88 -1.1312
high_card__MERCADO_TRABAJO_LOCAL 1.0634
low_card__LLAVE_RELIGION_63 -0.8920
low_card__LLAVE_RELIGION_213 0.8561
low_card__LLAVE_RELIGION_31 -0.8448
low_card__LLAVE_RELIGION_4 -0.8364
low_card__LLAVE_RELIGION_65 0.8105
low_card__LLAVE_RELIGION_99 -0.7601
low_card__LLAVE_RELIGION_79 0.7451
# 📊 Bar plot of top 15 coefficients
plt.figure(figsize=(10, 6))
plt.barh(coef_df_sorted["Feature"], coef_df_sorted["Coefficient"])
plt.gca().invert_yaxis()
plt.title("Top 15 Ridge Regression Coefficients")
plt.xlabel("Coefficient Value")
plt.tight_layout()
plt.show()
# ✅ Use the trained Ridge model for interpretation
model = ridge # You can also switch to lasso here if needed
# 🧠 Initialize SHAP Explainer for linear models
# Uses model coefficients and data to compute feature contributions
explainer = shap.Explainer(model, X_train_prepared, feature_names=feature_names, algorithm="linear")
# 📉 Compute SHAP values for all training examples
shap_values = explainer(X_train_prepared)
# 📊 Visualize the most impactful features
# Beeswarm: Shows both magnitude and direction of impact for each feature
shap.plots.beeswarm(shap_values, max_display=15)
# 📊 Alternative: Bar plot showing mean absolute SHAP values
# Good for quick ranking of feature importance
shap.plots.bar(shap_values, max_display=15)
📊 Model 4: Random Forest Regressor
The Random Forest Regressor is a powerful and flexible machine learning algorithm used for predicting continuous outcomes. It belongs to the family of ensemble methods and is especially useful when:
- You suspect nonlinear relationships between features and the target
- You want robust performance without heavy assumptions about the data structure
🔍 How it works:
- A Random Forest builds multiple decision trees on random subsets of both the observations and the features.
- Each tree produces a prediction, and the final output is the average of all tree predictions.
- This averaging process reduces variance and guards against overfitting.
# 🌲 Instantiate a Random Forest Regressor
# We use 100 trees and a fixed random seed for reproducibility
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
# 🔁 Perform 5-fold cross-validation using negative MSE as the scoring metric
# (scikit-learn returns negative values for loss functions, so we convert to positive)
rf_scores = cross_val_score(rf_model, X_train_prepared, y_train, cv=5, scoring='neg_mean_squared_error')
rf_mse = -rf_scores.mean()
rf_var = rf_scores.var()
# 🖨️ Report performance metrics
print(f"🌲 Random Forest MSE: {rf_mse:.4f}")
print(f"📉 Variance of MSE: {rf_var:.6f}")
# ✅ Fit the model on the entire training set
rf_model.fit(X_train_prepared, y_train)
🌲 Random Forest MSE: 0.5289 📉 Variance of MSE: 0.001743
RandomForestRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
# 🌟 Extract feature importances from the trained Random Forest model
importances = rf_model.feature_importances_
feature_names = full_pipeline.get_feature_names_out()
# 📋 Create a DataFrame linking feature names to their importance scores
importance_df = pd.DataFrame({
"Feature": feature_names,
"Importance": importances
})
# 🔽 Sort the features by importance (descending order)
importance_df = importance_df.sort_values(by="Importance", ascending=False)
# 🖨️ Display the top 15 most important features
print("📌 Top 15 most important features in Random Forest:")
print(importance_df.head(15))
📌 Top 15 most important features in Random Forest:
Feature Importance
7 num__ANIO ESCOLARIDAD_ACUMULADA 0.2313
94 high_card__MERCADO_TRABAJO_LOCAL 0.2230
5 num__EDAD ESCOLARIDAD_ACUMULADA 0.0942
4 num__EDAD ANIO 0.0643
78 low_card__ANIO_2000 0.0538
0 num__EDAD 0.0347
3 num__EDAD^2 0.0346
83 low_card__LLAVE_SEXO_2 0.0184
82 low_card__LLAVE_SEXO_1 0.0161
92 low_card__LLAVE_SITUACONYUGAL_8 0.0153
85 low_card__LLAVE_SITUACONYUGAL_1 0.0128
8 num__ESCOLARIDAD_ACUMULADA^2 0.0111
16 low_card__LLAVE_TIETRASLADO_TRABAJO_7 0.0108
2 num__ESCOLARIDAD_ACUMULADA 0.0105
93 low_card__LLAVE_SITUACONYUGAL_9 0.0103
# ⚡ Use a representative sample (e.g., 1000 rows)
sample_idx = np.random.choice(X_train_prepared.shape[0], size=1000, replace=False)
# ✅ Ensure dense and numeric format
X_dense = X_train_prepared.toarray() if hasattr(X_train_prepared, "toarray") else X_train_prepared
X_sample = X_dense[sample_idx].astype(np.float64) # Ensure float64 dtype
# 🧠 Create TreeExplainer
explainer = shap.TreeExplainer(rf_model)
# 🧮 Compute SHAP values
shap_values = explainer.shap_values(X_sample)
# 🏷️ Feature names
feature_names = full_pipeline.get_feature_names_out()
# 🐝 Beeswarm plot
shap.summary_plot(shap_values, X_sample, feature_names=feature_names, max_display=15)
# 📊 Bar plot (summary)
shap.summary_plot(shap_values, X_sample, feature_names=feature_names, plot_type="bar", max_display=15)
📊 Model 5: Gradient Boosting Regressor
Gradient Boosting is a powerful ensemble method that builds a strong predictive model by combining many weak learners—typically shallow decision trees. It improves performance by learning from the mistakes of previous trees in a sequential, focused way.
🔍 How it Works:
- Builds Models Sequentially Instead of training one big model, Gradient Boosting trains a series of small trees—each new one focusing on the residuals (errors) of the last.
- Learns from Mistakes Each new tree is designed to correct the errors of the previous one. In essence, the model gradually learns what previous models got wrong.
- Focuses on the Hardest Cases By placing more weight on examples that are harder to predict correctly, the model pays attention to challenging patterns in the data.
- Combines Weak Learners into a Strong One The predictions from each tree are added together, producing a final model that is both accurate and robust.
# Instantiate model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
# Cross-validation
gb_scores = cross_val_score(gb_model, X_train_prepared, y_train, cv=5, scoring='neg_mean_squared_error')
gb_mse = -gb_scores.mean()
gb_var = gb_scores.var()
print(f"🔥 Gradient Boosting MSE: {gb_mse:.4f}")
print(f"📉 Variance of MSE: {gb_var:.6f}")
# Fit full model
gb_model.fit(X_train_prepared, y_train)
🔥 Gradient Boosting MSE: 0.4803 📉 Variance of MSE: 0.001351
GradientBoostingRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(random_state=42)
# Get feature importances from the model
importances = gb_model.feature_importances_
# Get feature names from preprocessing pipeline
feature_names = full_pipeline.get_feature_names_out()
# Create DataFrame with importance
importance_df = pd.DataFrame({
"Feature": feature_names,
"Importance": importances
})
# Sort by importance (descending)
importance_df_sorted = importance_df.sort_values(by="Importance", ascending=False)
# Display top features
print("📌 Top 15 most important features in Gradient Boosting:")
print(importance_df_sorted.head(15))
📌 Top 15 most important features in Gradient Boosting:
Feature Importance
7 num__ANIO ESCOLARIDAD_ACUMULADA 0.3720
94 high_card__MERCADO_TRABAJO_LOCAL 0.1865
5 num__EDAD ESCOLARIDAD_ACUMULADA 0.1296
78 low_card__ANIO_2000 0.0693
82 low_card__LLAVE_SEXO_1 0.0470
6 num__ANIO^2 0.0466
4 num__EDAD ANIO 0.0354
1 num__ANIO 0.0285
83 low_card__LLAVE_SEXO_2 0.0178
81 low_card__ANIO_2020 0.0128
16 low_card__LLAVE_TIETRASLADO_TRABAJO_7 0.0110
20 low_card__LLAVE_PERTEINDIGENA_3 0.0093
93 low_card__LLAVE_SITUACONYUGAL_9 0.0077
0 num__EDAD 0.0040
9 low_card__LLAVE_TIETRASLADO_TRABAJO_0 0.0039
# ⚡ Sample from the training data
sample_idx = np.random.choice(X_train_prepared.shape[0], size=1000, replace=False)
X_sample_sparse = X_train_prepared[sample_idx]
# 🔧 Convert sparse matrix to dense and ensure float format
X_sample = X_sample_sparse.toarray().astype(np.float64)
# 🧠 Create the SHAP explainer (with feature names)
explainer = shap.Explainer(gb_model, X_sample, feature_names=feature_names)
# 📊 Compute SHAP values
shap_values = explainer(X_sample)
# 🐝 SHAP beeswarm plot
shap.plots.beeswarm(shap_values, max_display=15)
# ✅ SHAP bar plot
shap.plots.bar(shap_values, max_display=15)
📊 Summary of Model Performance on Training Data
The following table summarizes the performance metrics (MSE, MAE, R²) for each model evaluated on the training dataset. These metrics provide insights into how well each model fits the data and captures the underlying patterns.
- MSE (Mean Squared Error): Lower values indicate better fit.
- MAE (Mean Absolute Error): Lower values indicate better fit.
- R²: Higher values (closer to 1) indicate a better model fit, with 1 meaning perfect fit.
Here are the results for the models we have trained:
# Dictionary to store the results
results_train = {
'Model': ['Linear Regression', 'Lasso Regression', 'Ridge Regression', 'Random Forest', 'Gradient Boosting'],
'MSE': [],
'MAE': [],
'R²': []
}
# List of fitted models
models = [lr_model, lasso, ridge, rf_model, gb_model]
model_names = ['Linear Regression', 'Lasso Regression', 'Ridge Regression', 'Random Forest', 'Gradient Boosting']
# Iterate through the models and calculate the metrics
for model in models:
y_train_pred = model.predict(X_train_prepared)
mse = mean_squared_error(y_train, y_train_pred)
mae = mean_absolute_error(y_train, y_train_pred)
r2 = r2_score(y_train, y_train_pred)
# Append results to the dictionary
results_train['MSE'].append(mse)
results_train['MAE'].append(mae)
results_train['R²'].append(r2)
# Create a DataFrame to display the results
results_df_train = pd.DataFrame(results_train)
# Print the results
print("Training Data Model Evaluation Metrics:")
print(results_df_train)
Training Data Model Evaluation Metrics:
Model MSE MAE R²
0 Linear Regression 0.4803 0.4874 0.4262
1 Lasso Regression 0.4873 0.4923 0.4179
2 Ridge Regression 0.4803 0.4874 0.4263
3 Random Forest 0.0770 0.1946 0.9080
4 Gradient Boosting 0.4565 0.4752 0.4547
Final Step: Test All Models on the Test Data¶
After evaluating and tuning the models on the training set, it’s crucial to assess their performance on unseen data (the test set). This ensures that the models generalize well to new data and that we are not overfitting to the training data.
In this step, we will test all the models (Linear Regression, Lasso, Ridge, Random Forest, and Gradient Boosting) on the test set and evaluate their performance using the following metrics:
- Mean Squared Error (MSE): Measures the average of the squares of the errors, which tells us how well our model is fitting the data.
- Mean Absolute Error (MAE): Measures the average of the absolute errors, giving us an idea of how far off predictions are from actual values.
- R² (Coefficient of Determination): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.
By comparing these metrics across models, we can identify which model performs the best on unseen data.
✅ Step 1: Prepare the Test Data
# Preprocess the test set using the already fitted pipeline
X_test_prepared = full_pipeline.transform(X_test)
✅ Step 2: Establish a Baseline Model
# Predict the mean of y_train for every observation
baseline_pred = np.full_like(y_test, y_train.mean())
# Evaluate metrics for mean baseline model
baseline_mse = mean_squared_error(y_test, baseline_pred)
baseline_mae = mean_absolute_error(y_test, baseline_pred)
baseline_r2 = r2_score(y_test, baseline_pred)
# Print results
print("Baseline Model (Mean Prediction) Performance:")
print(f"MSE: {baseline_mse:.4f}")
print(f"MAE: {baseline_mae:.4f}")
print(f"R²: {baseline_r2:.4f}")
Baseline Model (Mean Prediction) Performance: MSE: 0.8538 MAE: 0.6816 R²: -0.0013
✅ Step 3: Evaluate All Models on the Test Set
# Models you previously trained:
models = [lr_model, lasso, ridge, rf_model, gb_model]
model_names = ['Linear Regression', 'Lasso Regression', 'Ridge Regression', 'Random Forest', 'Gradient Boosting']
# Evaluate models on test data
results = {
'Model': ['Baseline (Mean)'], # Start with baseline
'MSE': [baseline_mse],
'MAE': [baseline_mae],
'R²': [baseline_r2]
}
# Evaluate each trained model
for model, name in zip(models, model_names):
y_pred = model.predict(X_test_prepared)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results['Model'].append(name)
results['MSE'].append(mse)
results['MAE'].append(mae)
results['R²'].append(r2)
✅ Step 4: Compare the Results
# Convert to DataFrame and sort clearly
results_df = pd.DataFrame(results).sort_values(by='MSE')
print("Final Model Evaluation Metrics:")
print(results_df)
Final Model Evaluation Metrics:
Model MSE MAE R²
1 Linear Regression 0.5035 0.5025 0.4095
3 Ridge Regression 0.5035 0.5025 0.4095
2 Lasso Regression 0.5094 0.5042 0.4026
5 Gradient Boosting 0.5095 0.5045 0.4024
4 Random Forest 0.5453 0.5321 0.3606
0 Baseline (Mean) 0.8538 0.6816 -0.0013